Goto

Collaborating Authors

 relationship type


From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship

Xu, Yue, Wang, Wenjie

arXiv.org Artificial Intelligence

Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.


Proactive User Information Acquisition via Chats on User-Favored Topics

Sato, Shiki, Baba, Jun, Hentona, Asahi, Iwata, Shinji, Yoshimoto, Akifumi, Yoshino, Koichiro

arXiv.org Artificial Intelligence

Chat-oriented dialogue systems designed to provide tangible benefits, such as sharing the latest news or preventing frailty in senior citizens, often require Proactive acquisition of specific user Information via chats on user-faVOred Topics (PIVOT). This study proposes the PIVOT task, designed to advance the technical foundation for these systems. In this task, a system needs to acquire the answers of a user to predefined questions without making the user feel abrupt while engaging in a chat on a predefined topic. We found that even recent large language models (LLMs) show a low success rate in the PIVOT task. We constructed a dataset suitable for the analysis to develop more effective systems. Finally, we developed a simple but effective system for this task by incorporating insights obtained through the analysis of this dataset.


Relational Norms for Human-AI Cooperation

Earp, Brian D., Mann, Sebastian Porsdam, Aboy, Mateo, Awad, Edmond, Betzler, Monika, Botes, Marietjie, Calcott, Rachel, Caraccio, Mina, Chater, Nick, Coeckelbergh, Mark, Constantinescu, Mihaela, Dabbagh, Hossein, Devlin, Kate, Ding, Xiaojun, Dranseika, Vilius, Everett, Jim A. C., Fan, Ruiping, Feroz, Faisal, Francis, Kathryn B., Friedman, Cindy, Friedrich, Orsolya, Gabriel, Iason, Hannikainen, Ivar, Hellmann, Julie, Jahrome, Arasj Khodadade, Janardhanan, Niranjan S., Jurcys, Paul, Kappes, Andreas, Khan, Maryam Ali, Kraft-Todd, Gordon, Dale, Maximilian Kroner, Laham, Simon M., Lange, Benjamin, Leuenberger, Muriel, Lewis, Jonathan, Liu, Peng, Lyreskog, David M., Maas, Matthijs, McMillan, John, Mihailov, Emilian, Minssen, Timo, Monrad, Joshua Teperowski, Muyskens, Kathryn, Myers, Simon, Nyholm, Sven, Owen, Alexa M., Puzio, Anna, Register, Christopher, Reinecke, Madeline G., Safron, Adam, Shevlin, Henry, Shimizu, Hayate, Treit, Peter V., Voinea, Cristina, Yan, Karen, Zahiu, Anda, Zhang, Renwen, Zohny, Hazem, Sinnott-Armstrong, Walter, Singh, Ilina, Savulescu, Julian, Clark, Margaret S.

arXiv.org Artificial Intelligence

How we should design and interact with social artificial intelligence depends on the socio-relational role the AI is meant to emulate or occupy. In human society, relationships such as teacher-student, parent-child, neighbors, siblings, or employer-employee are governed by specific norms that prescribe or proscribe cooperative functions including hierarchy, care, transaction, and mating. These norms shape our judgments of what is appropriate for each partner. For example, workplace norms may allow a boss to give orders to an employee, but not vice versa, reflecting hierarchical and transactional expectations. As AI agents and chatbots powered by large language models are increasingly designed to serve roles analogous to human positions - such as assistant, mental health provider, tutor, or romantic partner - it is imperative to examine whether and how human relational norms should extend to human-AI interactions. Our analysis explores how differences between AI systems and humans, such as the absence of conscious experience and immunity to fatigue, may affect an AI's capacity to fulfill relationship-specific functions and adhere to corresponding norms. This analysis, which is a collaborative effort by philosophers, psychologists, relationship scientists, ethicists, legal experts, and AI researchers, carries important implications for AI systems design, user behavior, and regulation. While we accept that AI systems can offer significant benefits such as increased availability and consistency in certain socio-relational roles, they also risk fostering unhealthy dependencies or unrealistic expectations that could spill over into human-human relationships. We propose that understanding and thoughtfully shaping (or implementing) suitable human-AI relational norms will be crucial for ensuring that human-AI interactions are ethical, trustworthy, and favorable to human well-being.


Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation

Ågren, Wilhelm, Sosa, Victorio Úbeda

arXiv.org Artificial Intelligence

The generation of synthetic data is a state-of-the-art approach to leverage when access to real data is limited or privacy regulations limit the usability of sensitive data. A fair amount of research has been conducted on synthetic data generation for single-tabular datasets, but only a limited amount of research has been conducted on multi-tabular datasets with complex table relationships. In this paper we propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. We compare our results to the probabilistic model HMA1. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets, whilst achieving adequate data quality and always guaranteeing referential integrity. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships. We additionally suggest that the HMA1 model should be used on smaller datasets when emphasis is on data quality.


Subversive Characters and Stereotyping Readers: Characterizing Queer Relationalities with Dialogue-Based Relation Extraction

Chang, Kent K., Ho, Anna, Bamman, David

arXiv.org Artificial Intelligence

Television is often seen as a site for subcultural identification and subversive fantasy, including in queer cultures. How might we measure subversion, or the degree to which the depiction of social relationship between a dyad (e.g. two characters who are colleagues) deviates from its typical representation on TV? To explore this question, we introduce the task of stereotypic relationship extraction. Built on cognitive stylistics, linguistic anthropology, and dialogue relation extraction, in this paper, we attempt to model the cognitive process of stereotyping TV characters in dialogic interactions. Given a dyad, we want to predict: what social relationship do the speakers exhibit through their words? Subversion is then characterized by the discrepancy between the distribution of the model's predictions and the ground truth labels. To demonstrate the usefulness of this task and gesture at a methodological intervention, we enclose four case studies to characterize the representation of queer relationalities in the Big Bang Theory, Frasier, and Gilmore Girls, as we explore the suspicious and reparative modes of reading with our computational methods.


Gender Bias in Decision-Making with Large Language Models: A Study of Relationship Conflicts

Levy, Sharon, Adler, William D., Karver, Tahilin Sanchez, Dredze, Mark, Kaufman, Michelle R.

arXiv.org Artificial Intelligence

Large language models (LLMs) acquire beliefs about gender from training data and can therefore generate text with stereotypical gender attitudes. Prior studies have demonstrated model generations favor one gender or exhibit stereotypes about gender, but have not investigated the complex dynamics that can influence model reasoning and decision-making involving gender. We study gender equity within LLMs through a decision-making lens with a new dataset, DeMET Prompts, containing scenarios related to intimate, romantic relationships. We explore nine relationship configurations through name pairs across three name lists (men, women, neutral). We investigate equity in the context of gender roles through numerous lenses: typical and gender-neutral names, with and without model safety enhancements, same and mixed-gender relationships, and egalitarian versus traditional scenarios across various topics. While all models exhibit the same biases (women favored, then those with gender-neutral names, and lastly men), safety guardrails reduce bias. In addition, models tend to circumvent traditional male dominance stereotypes and side with 'traditionally female' individuals more often, suggesting relationships are viewed as a female domain by the models.


Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering

Pusch, Larissa, Conrad, Tim O. F.

arXiv.org Artificial Intelligence

Advancements in natural language processing have revolutionized the way we can interact with digital information systems, such as databases, making them more accessible. However, challenges persist, especially when accuracy is critical, as in the biomedical domain. A key issue is the hallucination problem, where models generate information unsupported by the underlying data, potentially leading to dangerous misinformation. This paper presents a novel approach designed to bridge this gap by combining Large Language Models (LLM) and Knowledge Graphs (KG) to improve the accuracy and reliability of question-answering systems, on the example of a biomedical KG. Built on the LangChain framework, our method incorporates a query checker that ensures the syntactical and semantic validity of LLM-generated queries, which are then used to extract information from a Knowledge Graph, substantially reducing errors like hallucinations. We evaluated the overall performance using a new benchmark dataset of 50 biomedical questions, testing several LLMs, including GPT-4 Turbo and llama3:70b. Our results indicate that while GPT-4 Turbo outperforms other models in generating accurate queries, open-source models like llama3:70b show promise with appropriate prompt engineering. To make this approach accessible, a user-friendly web-based interface has been developed, allowing users to input natural language queries, view generated and corrected Cypher queries, and verify the resulting paths for accuracy. Overall, this hybrid approach effectively addresses common issues such as data gaps and hallucinations, offering a reliable and intuitive solution for question answering systems. The source code for generating the results of this paper and for the user-interface can be found in our Git repository: https://git.zib.de/lpusch/cyphergenkg-gui


SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Cao, Feiqi, Luo, Siwen, Nunez, Felipe, Wen, Zean, Poon, Josiah, Han, Caren

arXiv.org Artificial Intelligence

Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.


RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Wang, Ming, Wu, Wenfang, Gao, Chongyun, Wang, Daling, Feng, Shi, Zhang, Yifei

arXiv.org Artificial Intelligence

Pre-trained Models have become the dominant approach in the field of deep learning since Transformer [1]. Buy now, the Large Language Models (LLMs) represented by ChatGPT [2] have received the widest attention from researchers in the field of Artificial Intelligence (AI), especially Natural Language Processing (NLP). Like LLaMA [3], many open-source LLMs [4, 5, 6, 3, 7, 8] have been published. Due to the strong reasoning, generative and memory abilities acquired by LLMs during training, they are able to operate a variety of traditional tasks based on specific prompts and achieve great performance. As a result, LLMs have gained widespread interest and applications, such as in the financial [9], emotional [10, 11], legal [12], medical [13, 14, 15] and educational [16] fields. To evaluate the capability of LLMs and to guide the selection of more appropriate LLMs in applications, many evaluation approaches [17] for LLMs have been proposed by researchers. C-Eval [18] constructed a reasoning test set of 13,948 questions in 52 subjects ranging from junior school to postgraduate university and vocational exams to evaluate LLM's problem-solving skills. Gaokao-Bench [19] collected questions from the 2010-2022 Chinese national college entrance examination papers, including 1,781 objective questions and 1,030 subjective questions, and constructed a framework for assessing the language comprehension and logical reasoning ability of LLMs. Microsoft has released a new benchmark test, AGIEval [20], by selecting 20 official, public, high-standard exams, including general university entrance exams (Chinese national college entrance examination and the U.S. SAT), law school entrance exams, maths competitions, bar exams, national civil service exams, and more.


When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, Alex, Asai, Akari, Zhong, Victor, Das, Rajarshi, Khashabi, Daniel, Hajishirzi, Hannaneh

arXiv.org Artificial Intelligence

Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the long tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.